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ABSTRACT 


Entities  cooperating  as  a  group  become  simpler  to  construct  if  they  possess  access  to  a  member¬ 
ship  service  to  manage  and  administer  the  membership  information  of  such  groups.  This  report  describes  the 
architecture  and  design  of  a  wide-area  group  membership  service.  Unlike  any  known  membership  service, 
the  service  is  based  on  a  completely  decentralized  protocol  executed  by  a  hierarchy  of  servers.  This  hierar¬ 
chy  permits  a  clear  separation  between  the  membership  service  infrastructure  and  support  for  application 
groups,  permitting  global  scaleability.  The  membership  protocol  itself  is  executed  by  a  core  set  of  member¬ 
ship  servers  identified  in  a  group-specific  manner,  permitting  a  separate  name  space,  membership  scope  and 
partition  handling  for  each  group.  We  describe  a  suitable  application  programmer's  interface  and  provide 
correctness  arguments  for  the  protocol.  A  working  implementation  of  the  basic  membership  protocol  is 
described. 
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I.  INTRODUCTION 


Construction  of  distributed  applications  that  require  co-operation  among  a  group  of  entities  is  fa¬ 
cilitated  by  a  variety  of  network-based  services  which  provide  clock  synchronization,  wide-area  file  systems, 
naming  and  directory  information,  and  secure  remote  access.  Such  applications  are  typically  also  required  to 
manage  dynamic  groups  of  entities,  such  as  processes,  devices  etc.,  for  which  the  membership  changes  due 
to  voluntary,  as  well  as  involuntary,  actions.  As  such  applications  proliferate  on  a  typical  organization's  wide 
area  network,  access  to  a  membership  service  (MS)  to  manage  and  administer  the  membership  information 
of  individual  groups  becomes  necessary.  As  shown  in  Figure  1,  each  group's  members  can  access  the  MS 
provided  by  membership  servers  (mservers)  spread  over  an  a  wide  area  network  (WAN)  supporting 
independent,  nested  and  overlapped  groups. 


Figure  1 :  Membership  Service  and  Client  Application  Groups 

The  type  of  membership  information  required  depends  upon  the  nature  of  the  cooperation  to  be 
achieved  by  the  members  of  the  client  groups.  Examples  of  membership-related  information  are  group  size, 
members'  identities,  their  geographical  and  organizational  distribution,  and  a  history  of  membership  changes. 

A  membership  service  has  typically  been  provided  as  an  embedded  component  of  toolkits  for  build¬ 
ing  distributed  applications.  Table  1  provides  a  summary  and  comparison  of  the  known  protocols  for  both 
synchronous  and  asynchronous  environments  When  the  maximum  delay  faced  by  a  message  can  be  guaran¬ 
teed,  the  environment  is  called  synchronous.  Such  a  guarantee  cannot  be  provided  in  asynchronous  environ¬ 
ments  in  which  a  message  may  be  delayed  arbitrarily.  The  Internet  represents  an  asynchronous  environment. 
We  note  that  many  of  the  protocols  listed  in  Table  1  make  unique  assumptions  about  the  communication 
environment. 

The  main  motivation  for  this  work  is  that  none  of  the  existing  approaches  takes  the  view  that  a 
membership  service  is  useful  as  an  internetwork-wide  service  much  like  the  name/directory  services  such  as 
the  Domain  Name  Service  or  X.500.  In  each  of  the  cases,  the  protocol  lacks  one  or  more  of  the  basic  fea¬ 
tures  of  scaleability  over  WANs,  portion  of  a  range  of  membership  services,  and,  in  many  cases,  assumes 
network  properties  that  are  not  representative  of  today's  wide-area  networks. 
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Therefore,  the  MS  described  here,  assumes  a  "best-effort"  network  such  as  the  Internet,  is  scale- 
able  with  respect  to  the  size/distribution/number  of  groups,  uses  network-level  multicasting  when  available, 
employs  a  decentralized  protocol  with  provably  minimal  number  of  phases  for  committing  changes,  and  of¬ 
fers  different  qualities-of-service(QoS). 

Table  1 ;  A  Summary  of  Existing  Membership  Protocols 
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This  report  describes  a  wide-area  membership  service  to  facilitate  the  creation  and  management  of 
large,  widely  distributed  groups  for  which  different  types  of  membership-related  information  must  be  main¬ 
tained.  Figure  2  shows  the  relation  of  the  MS  protocols  to  TCP/IP  suite  of  internetworking  protocols. 

Section  II -describes  the  MS  architecture  and  its  discrete  components.  Section  III  defines  and  ex¬ 
plains  the  protocols  used  by  MS  and  Section  IV  describes  the  implementation  of  the  core  membership 
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protocol.  In  Section  V  we  give  a  more  formal  description  of  the  protocol  and  provide  correctness  argu¬ 
ments.  We  conclude  wath  Section  VI. 

Application  and  Upper-Layer 
Protocol  Modules 


Membership  Seivice  Interface 

Member  Interface  IMF) 

Membership  Service 
Module 


Datagram  Network  Interface 


IP  Multicast 

multicast 

emulator 

User  Datagram  Protoco 

Figure  2  .  Position  of  MS  in  the  Communication  Protocol  Hierarchy 

II,  MEMBERSHIP  SERVICE  ARCHITECTURE 

The  key  to  a  scaleable  MS  is  a  decentralized,  hierarchical  architecture,  designed  to  exploit  the  exist¬ 
ing  physical  topology  of  subnetworks,  networks,  and  internetworks  upon  which  the  distributed  application 
process  groups  that  the  MS  supports  will  be  running.  This  section  describes  the  structure  and  composition 
of  the  physical  hierarchy  of  the  MS  and  how  this  architecture  supports  application  process  groups. 

A.  COMPONENTS  AND  THEIR  FUNCTIONS 

1.  Mservers  And  Member  Interfaces 

The  MS  is  comprised  of  two  primary  entities;  membership  servers  (mservers)  organized  in  a  physi¬ 
cal  tree  hierarchy  and  member  interfaces  (MI)  that  represent  the  leaves  of  the  tree.  The  mservers  are  pri¬ 
marily  responsible  for  processing  changes  and  providing  information  to  the  members  of  the  physical 
hierarchy  as  well  as  the  application  process  groups  using  the  MS.  Application  group  processes  interface  with 
the  MS  through  an  MI  process  running  on  each  host  computer.  Each  MI  accepts  requests  for  changes  to  or 
information  about  application  groups  from  the  individual  application  member  processes  running  on  the  par¬ 
ticular  host  computer.  The  MI  then  reliably  relays  these  requests  to  the  LAN  mserver  to  access  the  MS. 
The  MI  receives  responses  from  the  LAN  mserver  and  reliably  propagates  these  responses  to  the  application 
member  processes  that  it  supports.  Each  MI  supports  numerous  application  groups  and  numerous  individual 
member  processes  from  each  application  group. 

Figure  3  illustrates  an  example  logical  hierarchy  of  mservers,  Mis,  and  application  group  processes. 
The  architecture  shown  is  a  representative  configuration  for  a  small  area  encompassing  a  single  institution, 
such  as  a  campus  or  business.  The  MI  process  remains  resident  on  the  host  computer  even  if  no  applications 
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are  running  to  provide  quick  access  to  the  MS.  The  logical  hierarchy  shown  in  Figure  3  corresponds  to  the 
physical  topology  of  networks  and  computers.  It  shows  11  departmental  LANs  served  by  as  many  mservers. 
The  1 1  mservers  form  3  groups  at  the  building  level.  At  the  next  level  (top)  3  mservers  form  a  group  to 
serve  the  entire  campus.  This  hierarchy  of  mserver  groups  forms  the  MS  infrastructure.  Figure  4  shows  the 
messages  that  are  exchanged  between  LAN  mservers,  Member  Interfaces  and  application  group  members. 


Backbone 


m  Application  ^...Application  1 

Procestes  \uy--Appitcatton  2 
■  Ml  per  host  per  host  \^.  . Application  S 


Figure  3;  Logical  MS  Hierarchy 

The  mservers  and  Mis  are  administratively  configured  into  the  desired  physical  hierarchy  by  a  local 
system  administrator  or  cognizant  authority.  This  configuration  is  expected  to  be  semi-static,  normally 
changing  only  when  additions  and  deletions  to  the  physical  topology  are  made.  The  system  administrator 
will  assign  appropriate  names  for  each  set  of  mservers  at  each  level.  If  network-level  multicasting  is  avail¬ 
able,  will  join  each  set  into  a  multicast  group  for  efficient  communication.  The  assignment  of  a  set  name  and 
multicast  address  is  accomplished  when  the  set  of  mservers  is  created  and  joined  together,  using  software 
calls  between  the  MS  and  the  kernel. 

2.  Failures,  Partitions,  And  Dynamic  Reformation 

Mserver  failures  and  network  partitions  lead  to  a  dynamic  reconfiguration  of  the  physical  structure 
of  the  MS,  with  the  surviving  mservers  and  Mis  automatically  reforming  into  partitioned  sets.  Perceived 
mserver  failures  represent  virtual  partitioning  of  the  network  into  one  or  more  partitioned  subsets  of  the 
original  set  of  mservers.  Each  partitioned  subset  corresponds  to  the  subtree  of  the  physical  hierarchy  in  a 
single  piece  of  the  partition.  A  piece  of  a  partition  refers  to  all  of  the  mservers  which  are  still  able  to  commu¬ 
nicate  over  the  non-partitioned  network.  Each  partition  of  the  MS  reforms  and  continues  to  function,  pro¬ 
viding  service  to  all  application  process  groups  with  all  members  still  existing  in  the  partition.  The 
application  process  groups  which  span  the  partitioned  network  will  experience  a  partition  in  their  member¬ 
ship.  This  condition  will  continue  until  the  physical  network  partition  is  repaired,  at  which  time  the  physical 
hierarchy  of  mservers  will  either  administratively  or  automatically  be  reformed  to  the  original  configuration. 
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Once  the  physical  hierarchy  is  restored,  the  surviving  application  groups  will  also  be  reformed,  as  per  the 
QoS  related  to  partition  resolution  chosen  by  the  MS  user  at  start  up  time.  Mservers  may  detect  fail- 
rues/partitions  of  the  MS  by  continuously  monitoring  other  mservers  in  the  manner  described  later. 


Figure  4:  Messages  sent  by  LAN  Members  and  Member  Interfaces 

3.  Change-Processing  Core-Set 

The  group  of  mservers  at  each  level  in  the  hierarchy  is  called  a  change-processing  core-set,  with  re¬ 
spect  to  a  particular  application  group,  when  it  is  designated  to  be  responsible  for  processing  all  membership 
change  requests  submitted  by  members  of  that  application  group.  Every  such  set  is  also  responsible  for  en¬ 
acting  changes  in  the  physical  hierarchy  immediately  below  it.  The  change  processing  involves  reaching 
agreement  amongst  all  mservers  in  the  core-set  about  the  change  submitted  and  propagating  this  change 
back  to  the  application  or  physical  hierarchy  group  members,  who  are  then  guaranteed  to  have  a  new  view 
of  the  changed  group  membership 

For  example,  the  set  of  mservers  labeled  backbone  in  Figure  3  is  the  core-set  for  the  application  1 
group,  since  this  application  group  has  members  distributed  on  all  LANs.  The  sets  of  mservers  at  lower  lev¬ 
els  in  the  hierarchy  will  not  process  membership  changes  for  application  1 ,  but  will  submit  them  to  the  core¬ 
set  backbone,  then  relay  the  results  back  to  the  Mis.  Similarly  the  mserver  group  labeled  building  2  serves 
application  2  (■).  Also  if  the  mserver  for  department  3  LAN  fails,  the  mserver  group  labeled  building  1  will 
process  the  change  to  the  physical  hierarchy. 

For  each  membership  change  request  submitted  to  an  mserver  group,  a  coordinator  is  chosen.  The 
criteria  for  selecting  the  coordinator  depends  on  the  particular  type  of  change  and  how  it  was  submitted  to 
or  detected  by  the  mserver  group.  The  fact  that  the  coordinator  is  not  a  fixed  member  of  the  mserver  group, 
but  instead  varies  from  change  to  change,  is  a  powerful  feature  of  the  MS. 

4.  Lan  Mserver  Monitoring 

Due  to  the  high  bandwidth,  low  latency,  hardware  multicast  capability,  and  limited  number  of  Mis 
to  monitor,  the  mserver  representing  each  LAN  uses  a  simple  polling  scheme  to  conduct  status  monitoring 


7 


of  the  Mis  on  the  LAN.  Each  MI  on  the  LAN  is  successively  polled  with  a  Query  message  by  the  LAN 
mserver.  The  MI  responds  with  a  Reply  message  indicating  normal  status.  Timeouts  and  retries  are  used  to 
detect  a  non-responding  MI  and  announce  the  perceived  failure.  Note  that  this  monitoring  emphasizes 
collection  of  status  of  individual  MTs  on  the  LAN.  This  is  to  be  distinguished  from  the  monitoring  done  by 
IGMP  which  detects  if  there  exists  a  member  (any)  on  the  LAN  [6]. 

5,  Forming  The  Hierarchy 

The  final  organization  of  mservers  and  Mis  involves  forming  the  hierarchy  of  the  sets  of  mservers 
that  cooperate  for  monitoring  and  change  processing,  with  the  Mis  at  the  leaf  level.  As  shown  in  Figure  3, 
each  mserver  in  the  hierarchy  has  either  a  set  of  children  mservers  or  Mis.  All  mservers  and  Mis  also  have  a 
parent  mserver,  except  the  mservers  at  the  highest  level  of  the  hierarchy.  Each  mserver  above  the  lowest  lev¬ 
el  in  the  hierarchy  has  a  dual  membership  in  the  "child-set"  as  well  as  the  original  core-set  of  mservers. 

Having  the  parent  mserver  as  a  member  of  the  child-set  has  two  primary  advantages.  First,  the  par¬ 
ent  mserver  is  part  of  monitoring  the  child  set,  thus,  the  child-set  will  immediately  learn  of  the  failure  of  the 
parent  mserver  by  monitoring.  Second,  the  parent  mserver  takes  part  in  all  change  processing  conducted  by 
the  child-set;  therefore,  it  will  learn  of  any  changes  in  the  membership  of  the  child-set  directly.  Together, 
these  two  points  ensure  that  "vertical  monitoring"  is  conducted  in  the  hierarchy.  This  provides  the  means  to 
ensure  that  a  failure  or  partition  between  levels  in  the  MS  hierarchy  will  be  detected,  allowing  the  MS  to  re¬ 
form  as  necessary. 

B.  SUPPORT  FOR  APPLICATION  GROUPS 

The  MS  is  responsible  for  managing  the  membership  of  the  application  groups  and  providing  ser¬ 
vices  to  the  application  groups  with  features  as  described  below 

1.  Consistency 

The  primary  service  that  the  MS  provides  application  groups  is  a  consistent  view  of  the  group 
membership  at  all  members,  as  well  as  a  consistent  ordering  of  changes  to  the  membership  of  the  group  at  all 
members.  These  consistency  guarantees  ensure  that  a  group  member  either  acquires  the  same  consistent 
view  as  all  other  members  of  the  group  eventually,  or  is  excluded  from  the  membership  of  the  group.  The 
term  "eventually"  refers  to  the  asynchronous  nature  of  the  environment,  leading  to  delays  at  some  sites.  Us¬ 
ing  this  guarantee  of  consistent  membership  at  all  members,  the  application  can  expect  that  members  with 
the  same  group  view  number  have  seen  the  same  sequence  of  membership  changes  and  have  the  same  view 
of  the  membership  of  the  group.  Using  this  knowledge,  the  application  can  decide  to  accept  or  reject  mes¬ 
sages  from  other  application  processes  depending  on  the  included  group  view  number  [11,  27].  The  guaran¬ 
tee  of  consistent  membership  can  be  used  as  the  foundation  upon  which  to  build  many  distributed 
applications. 
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The  MS  provides  consistent  ordering  of  membership  changes  to  application  groups  by  ensuring  that 
only  one  change  is  ever  processed  at  a  time  in  the  core-set  of  that  application  group,  and  that  all  active  mem¬ 
ber  processes  eventually  receive  this  change.  The  selected  change  is  committed  by  all  core-set  mservers, 
then  reliably  propagated  to  the  Mis,  and  finally,  to  the  distributed  application  member  processes.  The  MS 
provides  the  guarantee  that  an  application  member  process  either  receives  each  revised  group  view  or  is  de¬ 
tected  as  failed,  and  excluded  from  the  group.  In  this  manner,  all  surviving  application  member  processes 
are  guaranteed  to  have  exactly  the  same  ordering  of  membership  changes. 

2.  Naming 

The  MS  manages  the  names  of  all  application  groups  using  the  MS.  Application  group  names  are 
guaranteed  unique  within  a  predetermined  scope.  When  an  application  group  is  created,  the  software  call 
from  the  application  to  the  MS  includes  as  a  parameter  a  level  in  the  MS  physical  hierarchy,  under  which  the 
application  group  name  will  be  guaranteed  unique.  This  name-scope  parameter  is  either  the  actual  name  of 
the  core-set  or  a  level  number  above  the  MI  level  in  the  physical  hierarchy.  For  example,  to  guarantee  an 
application  group  name  of  "application!"  as  unique  under  the  scope  of  the  backbone  core-set  from  Figure  3, 
the  name  backbone  or  the  level  number  2  would  be  used  as  the  name-scope  parameter.  The  name-scope 
level  must  be  at  or  above  the  core-set  level  for  the  application. 

With  the  creation  of  each  new  application  group,  the  name-scope  parameter  is  checked  at  each  lev¬ 
el  in  the  mserver  hierarchy  up  to  and  including  the  name-scope  level.  If  the  name  already  exists,  the  creation 
of  the  new  group  is  refused,  and  an  error  code  is  returned  to  the  calling  application.  If  the  name  is  not 
found,  then  it  is  registered  at  the  name-scope  level  of  mservers  and  a  successful  group  creation  is  reported  to 
the  calling  application.  When  new  application  members  at  distributed  locations  wish  to  join  an  existing  ap¬ 
plication  group,  a  join  request  is  submitted  via  the  resident  MI,  then  propagated  up  the  hierarchy  until  either 
an  mserver  is  located  with  the  application  name  stored  or  the  highest  level  in  the  physical  hierarchy  is 
reached  and  the  application  name  is  not  located  If  the  desired  application  group  name  is  located,  the  new 
member  is  joined  into  the  application  group  through  the  normal  change-processing  sequence,  and  a  succes¬ 
sful  join  is  reported  back  to  the  requesting  process.  If  the  name  is  not  located,  an  unsuccessful  join  attempt 
is  reported  back.  Through  judicious  use  of  the  name-scope  parameter,  application  names  may  be  used  freely 
with  little  concern  about  duplicate  name  usage. 

3.  Membership  Scope  Control 

An  additional  feature  provided  by  the  MS  is  the  ability  for  an  application  to  decide  at  what  level  in 
the  MS  physical  hierarchy  to  limit  the  scope  of  the  application  group.  By  providing  a  membership-scope  pa¬ 
rameter  with  the  creation  call  for  a  new  application  group,  the  application  guarantees  that  the  span  of  the 
application's  membership  will  not  exceed  that  of  the  given  core-set  level  in  the  physical  hierarchy.  In  return, 
the  MS  is  able  to  provide  more  efficient  service  by  limiting  the  scope  of  application  group  name  searches  to 
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the  membership-scope  level  and  below.  Instead  of  propagating  every  unsuccessful  application  group  name 
search  to  the  highest  level  of  the  MS  hierarchy,  the  name  search  will  cease  at  the  membership-scope  level. 
Without  use  of  the  membership-scope,  it  might  be  possible  for  a  bottleneck  to  form  at  the  "top"  of  the  MS 
hierarchy. 

4.  Member  Interfaces 

The  Mi's  accept  application  membership  change  and  information  requests  from  application  pro¬ 
cesses  and  submit  these  changes  to  the  mserver  hierarchy  for  processing.  When  the  change  or  information 
data  is  returned,  the  MI  passes  the  data  to  the  requesting  member  processes. 

The  MI,  running  on  an  individual  host  computer  is  capable  of  interfacing  multiple  application 
groups,  each  with  multiple  members,  with  the  LAN  mserver  and  maintains  a  list  of  all  application  groups  it  is 
managing  as  well  as  all  member  processes  from  these  groups  running  on  the  host  computer.  Thus,  the  mem¬ 
bership  information  for  each  application  group  is  maintained  in  a  decentralized,  scaleable  manner.  When  an 
application  member  process  needs  to  communicate  with  another  application  member  process  on  a  different 
host,  it  submits  a  request  for  addressing  information  to  the  MI.  The  MI  relays  this  information  request  to 
the  MS,  which  obtains  the  desired  information  from  the  Ml  managing  the  desired  member  process,  and  re¬ 
lays  the  information  back  to  the  requesting  MI  and  application  member  process. 

5.  Application  Group  Change  Processing 

As  previously  discussed,  application  group  change  processing  begins  with  the  submission  of  a 
change  request  to  the  host  MI.  This  request  is  relayed  to  the  core-set  of  the  application,  which  conducts  the 
mserver  change-processing  procedure,  resulting  in  all  core-set  mservers  committing  the  change.  Each  core¬ 
set  mserver  then  reliably  relays  the  change  directive  down  the  hierarchy  to  the  MI,  and  then  to  the  requesting 
application  process.  Timeouts  and  retries  are  again  used  to  detect  failures  and  partitions. 

C.  APPLICATION  INTERFACE  WITH  THE  MS 

Table  2  summarizes  the  calls  to  the  MS  that  provide  membership  service  to  the  clients.  In  addition 
to  these  explicitly  requested  actions,  the  MS  takes  appropriate  actions  when  failures  and/or  partitions  occur 
and  provides  notifications  to  all  remaining  members. 

HI.  PROTOCOL  DESCRIPTIONS 

A.  PROTOCOL  FUNCTIONS 

The  basic  change-processing  protocol  uses  a  modified  form  of  the  three-way  handshake  often  seen 
in  unreliable  networks  for  reliable  message  delivery.  The  coordinator  initiates  the  change  processing  vvdth  a 
multicast  to  all  group  mservers,  collects  acknowledgment  (ACK)  messages  from  all,  then  multicasts  a  final 
message  for  all  to  commit  the  change.  Timeouts  and  retries  are  used  by  mservers  waiting  to  receive  ACKi 
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or  Commit  messages  from  other  mservers  to  ensure  that  continual  progress  is  made  toward  completion  of 
the  change.  As  with  the  monitoring  scheme,  if  the  expected  reply  is  not  received  from  an  mserver  after  the 
timeout  period  and  all  successive  retries,  then  that  mserver  is  declared  failed  and  the  failure  is  announced  to 
all  other  mservers  in  the  group. 

Table  2:  Application  Interface  Calls 


CATEGORY 

NAME  OF  CALL 

DESCRIPTION 

i;reate_groiip  (char  •groiq>_name,  char  ‘properties) 

Creates  and  registers  v^T!h  the  msovaR  a  Wth  ^  gi\ai  name 

with  null  membership.  pEsmhs  tiie  creation  of  a  subgjxx^. 

setjgroup_attrflHites  (char  ‘groi:^,  char  ‘properties,  _): 

Sets  the  attributes  of  a  gioip  such  as  name-scope,  membershif 
scope  etc. 

1 

deletc_graup  (char  ‘graiq>_name,  _): 

Cleans  ip  a  group,  forcing  all  participating  membas  to  leave. 

se<_incmbcr_atlrfl)UlBS  (char  ‘gna^,  char  ‘address,  char 
‘properties,  _) 

Sets  the  attributes  fer  the  specified  member  of  a  group  such  as 
access  rights. 

split _group  (char ‘paraitjgroup,  char ‘cWkl_l,  char ‘chfld_2,_) 

Creates  tv<o  subgroups,  by  spliling  the  parent  group. 

met^_grDup  (char  *group_l,  char  •group_2,  char  •new_^rDi^, 

Creates  a  new  group  by  merging  two  old  groups. 

is_inembcr  (char  ‘address,  char  ‘groigi,  \ievs  ‘siessid,  _) 

Retijms  the  membership  Status  of  a  member  f<T  a  Specific  group  and 
\’iew. 

get  s-iew  (char  ‘group,  ~) 

Returns  the  cunent  view  scope  of  a  group  as  a  membership  list 

get  fat  (char  ‘pwip,  char  ‘properties,  s-iess  ‘viewid,  _) 

Returns  the  members  of  a  group  with  certain  properties  fiom  a -view . 

2 

get  groups  (»rtlevel,char ‘address, char ‘properties,— ); 

Get  a  list  of  groups  at  a  specified  level. 

get  member  attributes  (char ‘group, char ‘address,-) 

Gets  the  properties  for  the  specified  member. 

get  group  attributes  (char ‘group, -) 

Get  the  properties  for  the  specified  group. 

get  sites  (char ‘group  fat,-) 

Get  a  list  of  sites  on  which  members  of  specified  groups  east 

get _g5ites  (char  ‘group,  char  ‘member_attributes,  -) 

Get  a  list  of  gjTiup  sites  cn  uhich  the  Specified  grotf)  has  members 

with  specified  attributes. 

3 

send_m.sg  (char  ‘group,  char  ‘memberjittributes,  -) 

Send  a  messge  to  members  of  a  groip  with  given  attributes. 

join_j;rDup  (char  ‘group_iiame,  char  ‘my_address,  -) 

Jrans  a  group  if  it  exists,  dse  creates  cne.  Permits  a  new  incamaticti 
to  join  as  an  independent  member. 

leave _group  (char  ‘groupjiame,  char  ‘address,  -) 

Forces  a  member  with  given  address  to  leavePamits  leaving 
gn-ups  with  certain  attributes. 

Index  for  Categories:  l:Group  management,  2:  Group  information,  3:  Communication,  4:  Membership 


The  use  of  timeouts  and  retries  on  change-processing  messages  creates  a  secondary  but  essential 
method  of  detecting  mserver  failures.  Since  mserver  monitoring  uses  unicast  messages  and  change¬ 
processing  uses  multicasts,  it  is  possible  that  a  network  partition  could  occur  that  affected  only  multicast 
message  delivery  between  one  or  more  mservers.  The  inability  of  mservers  to  communicate  all  necessary 
data  creates  a  virtual  partition  between  the  mservers.  Without  the  use  of  this  secondary  detection  method,  it 
is  possible  that  one  or  more  mservers  could  be  functioning  perfectly  well,  sending  the  required  monitoring 
messages,  but  unable  to  respond  to  change-processing  messages,  thus  creating  a  deadlock  situation.  The 
timeout  and  retries-  on  change-processing  messages  ensures  that  an  mserver  unable  to  communicate  will  be 
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detected  failed,  and  the  remaining  mservers  will  be  able  to  complete  the  change  in  a  timely  manner.  In  the 
event  of  a  coordinator  failure  during  the  change  processing,  a  distributed  election  is  conducted  and  a  new 
coordinator  is  elected  to  continue  the  original  change.  This  is  described  later. 

1.  Types  Of  Changes 

There  are  three  primary  types  of  membership  changes  processed  by  a  group  of  mservers:  requests, 
failures,  and  dynamic  reconfigurations. 

a.  Requests 

Requests  are  voluntary,  planned  membership  changes,  submitted  to  the  group  for  process¬ 
ing  by  an  application  process  or  system  administrator.  Change  requests  for  the  MS  physical  hierarchy  may 
be  to  Join  to  a  core  set,  Lea\<e  a  core  set,  Split  a  core  set  to  form  two  new  ones,  Merge  two  core  sets  to 
form  one.  Add jxtrent  to  add  a  parent  mserver  to  the  core  set  and  Del jxireni  to  remove  a  parent  from  the 
core  set.  Physical  change  requests  are  multicast  to  a  specific  group  in  the  hierarchy  by  a  system  configura¬ 
tion  call,  usually  invoked  by  a  system  administrator  during  manual  configuration  of  the  MS  hierarchy. 
Application  group  change  requests  are  submitted  to  the  resident  MI  process  on  the  host  computer  by  the  ap¬ 
plication  user.  The  MI  then  propagates  the  request  to  the  group  mserver  above  it  in  the  hierarchy.  The  re¬ 
ceiving  group  mserver  queues  the  request  to  be  processed  when  other  higher  priority  changes  have 
completed  processing. 

b.  Failures 

The  second  primary  type  of  membership  changes  are  detected  failures.  These  detected 
failures  may  be  the  result  of  the  actual  failure  of  an  mserver,  MI,  or  application  process,  or  the  host  machines 
upon  which  they  are  running.  Additionally,  network  partitions  will  be  perceived  as  failures  of  the  partitioned 
mservers,  and  will  lead  to  the  processing  of  failures  and  reformation  of  the  partitioned  subsets  of  mservers 
and  subgroups  of  application  processes  The  partitioning  of  the  MS  physical  hierarchy  leads  to  a  partitioning 
of  the  application  groups  residing  on  this  hierarchy.  The  MS  automatically  reforms  both  the  physical  hierar¬ 
chy  and  the  supported  application  groups  in  the  event  of  a  network  partition  Failures  detected  or  received 
by  a  group  mserver  are  queued  and  processed  according  to  their  priority.  Multiple  failures  queued  at  a 
group  mserver  are  processed  all  at  once,  in  a  "batched"  manner.  This  greatly  reduces  the  time  required  to 
reform  physical  core-sets  or  application  groups. 

c.  Dynamic  Reconfigurations 

The  final  type  of  changes  are  the  result  of  automatic  actions  taken  by  core-sets  of  mserv¬ 
ers.  This  type  of  dynamic  reconfiguration  occurs  when  new  members  join  an  application  group,  causing  the 
span  of  the  application  group  to  increase  beyond  that  presently  covered  by  the  current  application  core-set. 
In  this  event,  the  application  core-set  must  be  moved  fi'om  the  present  level  in  the  physical  hierarchy  to  a 
higher  level  covering  the  new  span  of  the  application.  This  new  level  must  be  at  or  below  the  name-scope 
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and  membership-scope  levels  of  the  application  group,  if  these  levels  were  designated  when  the  application 
group  was  created.  The  MS  automatically  moves  the  application  core-set  to  the  new  level.  In  a  similar 
manner,  the  departure  of  application  member  processes  may  lead  to  a  reduced  span  of  the  application.  An 
application  core-set  must  have  at  least  two  mservers  with  application  members  in  their  subtrees;  otherwise, 
there  is  no  need  to  have  the  application  core-set  at  this  level  in  the  hierarchy.  If  the  application  core-set  is 
reduced  to  only  one  mserver  supporting  an  application,  the  application  core-set  will  automatically  move 
down  to  the  child-set  of  this  mserver. 

The  repositioning  of  an  application  core-set  is  initiated  by  the  set  of  mservers  detecting  the 
need  to  move  the  application  core-set.  Messages  are  exchanged  between  the  old  and  new  core-sets  and  a 
change  involving  the  join  or  departure  of  the  instigating  application  member  is  processed  along  with  the 
change  in  application  core-set  level  by  both  core-sets.  After  committing  the  changes,  the  internal  state  of  all 
mservers  in  both  core-sets  is  changed  to  reflect  the  new  application  core-set  level. 

2.  Ordering  And  Priority  Of  Change  Processing 

A  key  issue  associated  with  processing  membership  changes  is  the  ordering  of  changes  committed 
by  the  core-set.  As  previously  described,  to  guarantee  consistent  ordering  of  membership  changes  at  all 
mservers  in  the  group,  only  one  change  may  be  committed  at  a  time.  However,  it  is  possible  that  more  than 
one  membership  change  may  be  submitted  to  or  detected  by  the  group  at  one  time.  Each  receiving  or  de¬ 
tecting  mserver  in  the  group  will  attempt  to  become  the  group  coordinator  and  initiate  the  change  it  received 
or  detected.  These  multiple  change  initiation  attempts  are  referred  to  as  "virtually  simultaneous",  since  they 
have  all  been  initiated  before  the  group  has  reached  a  consistent  and  uniform  decision  on  the  current  change 
to  process. 

To  resolve  these  virtually  simultaneous  changes  and  select  only  one  change  to  be  processed,  a 
priority  scheme  is  used.  This  scheme  uses  the  type  of  change  and  the  unique  group  id  (rank)  of  the  subject  of 
the  change  to  decide  which  change  will  be  processed  by  the  group.  The  subject  of  a  change  refers  to  the 
member  whose  membership  status  has  changed.  The  highest  priority  is  given  to  any  current  change  being 
processed  by  the  group;  that  is,  a  change  which  is  in  progress  at  an  mserver  (i.e.  an  ack  to  the  initiate  has 
been  sent).  It  is  essential  that  such  a  change  progresses  to  completion  at  all  group  mservers;  otherwise,  the 
possibility  of  inconsistent  membership  views  exists  if  some  mservers  commit  the  change  while  others  do  not. 
The  next  lower  priority  is  that  physical  hierarchy  changes  always  have  priority  over  application  group 
changes.  This  is  because  it  is  important  to  ensure  a  complete  and  whole  MS  infrastructure  before  attempting 
to  change  the  membership  of  an  application  group  using  the  MS.  Once  these  decisions  have  been  made,  the 
priority  of  the  change  is  determined  by  the  rank  or  age  of  the  subject  of  the  change  in  the  group.  The  only 
exceptions  to  this  rule  are  for  the  failure  of  the  coordinator  of  the  current  change  or  a  Join.  The  failure  of 
the  coordinator  of  a  change  in  progress,  has  priority  over  otherwise  equal  status  changes.  A  newly  joining 


13 


mserver  or  member  will  not  have  an  associated  rank  until  after  the  join  is  completed.  For  this  reason,  the 
network  address  of  the  joining  member  is  used  instead  of  a  rank  number  to  decide  priority  among  Joins. 
The  final  rule  used  to  determine  the  priority  of  virtually  simultaneous  changes  is  applicable  when  changes  are 
submitted  to  the  core-set  by  different  application  groups  with  identical  subject  rankings  in  each  group.  In 
this  case,  a  tie-breaker  is  needed,  and  the  ranks  of  the  coordinators  in  the  group  are  used  to  decide  which 
change  will  be  processed. 

A.  CHANGE  PROTOCOL 

The  basic  change-processing  protocol  consists  of  two  phases:  the  Initiate  and  Commit  phases.  A 
timeline  for  this  protocol  is  shown  in  Figure  5.  In  the  Initiate  phase,  the  coordinator  multicasts  an  Initiate 
message  to  all  mservers  in  the  group.  The  group  mservers  respond  with  ACKs,  acknowledging  reception  of 
the  Initiate  message.  When  the  coordinator  has  received  all  the  acknowledgments,  the  second  phase  of  the 
protocol  begins  with  the  coordinator  sending  the  Commit  message.  This  message  indicates  to  members  of 
the  group  that  is  safe  to  commit  the  change.  Phase  I  is  achieved  through  a  reliable  schema.  The  coordinator 
sends  the  Initiate  message  a  predetermined  number  of  times  if  one  or  more  group  mservers  do  not  reply.  Af¬ 
ter  that,  it  assumes  that  the  mserver(s)  that  did  not  reply  have  failed. 

B.  CHANGE  PROTOCOL  WHEN  COORDINATOR  FAILS 

The  two  phase  protocol  is  not  sufficient  in  case  coordinator  of  a  change  fails  while  processing  a 
change.  As  shown  by  Riccardi  and  Birman  [9],  a  three  phase  protocol  is  required.  After  the  coordinator  of 
the  current  change  fails,  and  its  failure  is  detected  by  a  member  of  the  group,  a  three  phase  protocol  is  initi¬ 
ated,  with  the  election  of  the  new  coordinator  as  the  first  phase.  Figure  6(a)  illustrates  the  three  phase  elec¬ 
tion  and  change  processing  protocol.  In  the  election  phase  only  those  members  that  have  finished  phase  one 
of  the  original  change  protocol  participate. 


Phase  1  Phase  11 
Initiate  Commit 


coordinator 

core-set 

mservers 


^  4'  >  > 

_ 


time 


Figure  5:  Basic  Two  Phase  Change  Protocol 


When  the  new  coordinator  is  elected,  it  knows  the  status  of  each  member  with  respect  to  the 
change,  due  to  a  status  broadcast  during  the  election  phase.  If  at  least  one  mserver  has  finished  the  change 
(committed)  it  means  that  the  old  (and  detected  failed)  coordinator  had  already  collected  all  ACKs  and  had 
started  the  Commit  phase.  So  the  new  coordinator,  knowing  that  phase  one  was  completed,  can  continue 


14 


with  the  final  phase  of  the  change,  instead  of  restarting  phase  one.  This  compressed  three  phase  protocol  is 
shown  in  Figure  6(b). 


detector 

core-set 

mservers 


Phase  I  Phase  II  Phase  III 

Election  Initiate  Commit 


Phase  1  Phase  II 

Election  Commit 


(a)  (b) 

Figure  6:  Three  Phase  (Election  and  Change)  and  Compressed  Three  Phase  Protocol 


A  simplified  algorithm  for  the  two  phase  and  three  phase  algorithm  is  shown  in  Figure  7.  Only  the 
important  arguments  are  shown.  All  sends  and  receives  of  messages  are  done  with  timeouts.  This  way  the 
protocol  does  not  block  and  takes  appropriate  actions  in  case  of  no  response.  In  line  5,  the  term  "reliably" 
implies  that  a  message  is  sent  and  specific  answers  are  expected  from  all  members  in  a  specific  time  interval. 
If  some  or  all  of  them  do  not  reply,  a  number  of  retries  are  attempted.  After  the  retries  and  the  last  timeout 
expire,  the  sender  assumes  the  non  responding  mservers  to  have  failed  In  line  1 1,  the  same  frinction  is  called 
recursively  to  process  the  failure(s)  of  the  member(s)  that  did  not  reply.  In  line  8,  the  second  phase  is  ex¬ 
ecuted  by  sending  a  commit  message.  Then  the  coordinator  has  to  make  the  change  to  its  internal  state. 
Lines  13  through  25  are  devoted  to  the  non  coordinator.  Since  this  code  is  executed  everywhere,  it  contains 
both  cases  (coordinator,  non-coordinator),  and  the  if  statement  in  line  3  decides  which  part  of  the  code 
should  execute. 

Lines  17  through  25  form  the  three  phase  protocol  in  case  the  coordinator  of  the  current  change 
fails.  Lines  17  through  21  refer  to  the  member(s)  that  detected  the  failure  of  the  current  coordinator.  Lines 
22  through  25  refer  to  the  member(s)  still  not  detect  the  coordinator's  failure  but  wait  for  a  Commit  mes¬ 
sage.  Both  sides  triggered  by  the  coord  Jail  message,  start  collecting  status  of  the  rest  of  the  members. 
Then  in  lines  20  and  24  an  election  of  a  new  coordinator  is  done.  The  lowest  rank  among  the  survived  and 
responded  mservers  gets  elected  and  all  members  go  to  process  the  new  (and  of  higher  priority)  change.  If 
the  current  coordinator  does  not  fail,  all  members  commit  the  change,  updating  their  internal  state,  in  line  27. 


C.  PARTITION  RESOLUTION  PROTOCOL 

As  part  of  the  processing  of  multiple  failures  caused  by  a  network  partition,  a  group  is  often  parti¬ 
tioned  into  two  or  more  subsets.  After  the  membership  changes  in  each  subset  stabilize,  these  sub-core-sets 
attempt  to  reform  into  the  original  group  by  sending  messages  to  the  other  subsets  of  mservers.  Since  the 
sub-core-sets  still  share  the  same  multicast  address,  once  the  network  partition  is  mended,  the  other  sub- 
core-sets  receive  these  reformation  messages.  Upon  learning  of  the  existence  of  a  sub-group  from  the  origi¬ 
nal  group,  the  partitioned  subsets  of  mservers  reform  into  the  original  group  automatically.  In  addition  to 
reforming  the  physical  group,  all  application  groups  which  were  partitioned  and  are  still  functioning  are  also 
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reformed.  The  reformation  process  for  both  physical  core-sets  and  application  groups  merges  the  currently 
existing  membership  of  each,  taking  the  union  of  all  subsets  or  subgroups,  and  making  the  reformed  group 
or  application  group  membership  the  current  view.  In  the  event  that  the  network  partition  is  not  repaired  in 
a  predetermined  period  of  time,  the  partitioned  subsets  of  mservers  will  abandon  their  attempts  to  reform  the 

original  group,  and  will  create  a  new  multicast  group  with  only  the  current  group  mserver  included. 

1.  /‘mernbeshipch^)ge  protocol  V 

2.  processjchange  (type  of  change,  sitijech 

3.  ifcoordnator 

4.  /‘start  phase  I*/ 

5.  send  agree  to  yonsrelably 

6.  receive  acks 

7.  put  members  that  dW  not  reply,  on  fai  1st 

8.  send  commit  message 

9.  commit  the  change 

10.  iffaiktisnotempty 

11.  process_change(fal,fistfifaiist) 

12. 

13.  else  /*non-a)Ofcfnafcir*/ 

14.  receive  agee  msg 

15.  sendacktDOOoncfrratof 

16.  wait  for  commit 

17.  if  commit  is  rwt  received 

1 8.  send  coord_fal  msg  broadcasfing  status 

19.  colect  status  from  other  members 

20.  determine  new  coorcfriator 

21 .  pfDcess_change  (fal,  old  coordnator) 

22.  else  if  coord  fal  msg  is  received 

23.  broadcast  own  status  a»xl  colect  status  from  other  members 

24.  determine  new  coordhator 

25.  process_change  (fai,  old  coordhator) 

26.  else  /"  commit  received '/ 

27.  commit  change 


Figure  7:  The  Membership  change  protocol  (two  phase  -  three  phase) 

If  the  group  partitions,  the  application  groups  that  span  the  partition,  also  experience  a  virtual  parti¬ 
tion.  These  partitions  are  handled  using  the  following  two  rules. 

•  Keep  alive  any  partitioned  subgroups  that  mee*  a  certain  condition  specified  by  the  user.  Any  subgroups 
vvfiich  do  not  meet  the  condition  are  terminated. 

•  Partitioned  subgroups  attempt  to  find  and  merge  with  other  partitioned  subgroups  that  have  a  certain 
user-specified  property. 

By  combining  these  two  rules,  every  possible  combination  of  partition  handling  methods  can  be 
produced.  The  first  rule  determines  who  survives,  and  the  second  rule  determines  who  will  attempt  to 
merge.  Each  rule  can  also  combine  multiple  parameters  to  provide  very  specific  and  flexible  methods  of 
handling  partitions.  For  example,  all  subgroups  larger  than  a  size  of  three  which  contain  a  particular  member 
type  could  be  permitted  to  survive  and  merge  with  subgroups  larger  than  half  of  the  original  group  size  and 
containing  another  particular  member  type.  Note  that  all  partitions  of  the  group  necessarily  survive  network 
partitions. 
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In  the  event  that  the  partitions  of  mservers  are  unable  to  restore  communications,  the  reformed  sub¬ 
sets  are  converted  to  completely  independent  core-sets.  Since  all  core-sets  of  mservers  must  have  a  unique 
name  and  multicast  address,  some  method  must  be  used  to  automatically  obtain  these  unique  values.  To  ob¬ 
tain  a  unique  name,  each  sub-group  appends  a  unique  suffix  to  the  original  group  name.  This  suffix  value 
must  be  automatically  derived  by  each  partitioned  subset  of  mservers  independently,  and  with  a  guaranteed 
unique  value  for  all  partitioned  subsets.  The  most  readily  available  attribute  that  all  subsets  can  use  to  obtain 
a  guaranteed  unique  name  is  the  original  group  identity  (gid)  of  a  significant  mserver  remaining  in  each  parti¬ 
tion.  The  lowest  mserver  gid  of  the  mservers  remaining  in  each  partition  is  appended  to  the  original  group 
name.  In  this  manner,  all  partitioned  core-sets  are  guaranteed  a  unique  group  name.  However,  all  parti¬ 
tioned  core-sets  are  still  easily  identifiable  as  subsets  of  the  original  group,  which  simplifies  the  task  of 
manually  reconfiguring  the  physical  hierarchy  when  the  network  is  repaired.  Once  a  unique  name  is  obtained, 
traffic  on  the  same  multicast  address  can  be  easily  filtered  by  the  individual  sub-core-sets. 

rV.  IMPLEMENTATION 

The  use  of  multicast  message  delivery  is  essential  to  the  efficient  and  scaleable  operation  of  an  MS. 
In  network  environments  where  IP  multicast  is  not  available  [7],  the  MS  must  use  point-to-point  messages. 
So  a  multicast  emulator  was  implemented  to  provide  with  the  same  interface  to  mservers  running  either  in 
an  IP-multicast-capable  or  a  unicast  environment.  Figure  8  shows  how  the  mcaster  is  embedded  in  the  com¬ 
munications  among  mservers  with  IP  multicast  capability  and  unicast  only  capability.  Each  mserver  has  two 
sockets.  One  of  them  is  unicast  dedicated  to  monitoring  and  the  other  can  be  either  a  multicast  or  a  unicast  if 
IP  multicast  is  not  available  on  the  LAN.  If  this  socket  is  multicast  mserver  uses  pure  IP  multicasting  to 
send  and  receive  group  messages.  If  it  is  unicast,  the  mserver  communicates  with  the  group  through  mcaster. 
Figure  8  shows  the  two  cases  where  the  sender  is  on  a  multicast  and  on  a  unicast  capable  host. 


:  mcaster  specific  message 

Figure  8:  Communication  Among  Mservers 
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The  2-phase  and  three-phase  membership  change  protocol  described  earlier,  has  been  implemented 
and  tested  over  a  campus  WAN  of  IP-multicast  and  non-multicast  LANs. 

V.  CORRECTNESS  ARGUMENTS 

In  this  section  we  present  the  correctness  arguments  to  show  that  the  structure  of  the  protocol  and 
the  actions  it  takes  at  each  mserver  lead  to  the  desired  attributes  of  the  service  itself 

The  focus  of  the  argument  is  on  a  core  set  of  mservers  as  a  group  and  refer  to  individual  mservers 
as  members  of  the  group.  It  is  shown  that  changes  to  the  membership  of  this  group  are  made  to  in  order  to 
maintain  consistency  of  the  membership  information.  Application  group  membership  consistency  can  be  ar¬ 
gued  about  in  an  identical  manner.  The  assumptions  made  and  the  terms  used  and  their  specific  implications 
in  the  correctness  arguments  are  described  first.  This  is  followed  by  the  correctness  criteria  and  a  summary 
of  the  actions  that  the  protocol  takes  to  maintain  the  membership  knowledge  accordingly.  Finally,  key  state¬ 
ments  about  different  aspects  of  the  protocol  are  proved. 

A.  ASSUMPTIONS 

As  described  previously,  an  asynchronous  communication  environment  is  assumed  to  exist,  provid¬ 
ing  an  unreliable  message  delivery  capability  with  an  unbounded  delay,  as  in  the  present  best-effort  Internet. 
Thus,  network  failures  that  include  dropped  messages  and  network  partitions  are  permitted.  All  member  fail¬ 
ures  are  assumed  to  be  crash  or  fail-stop  [5,  9,  10,  11].  In  such  conditions,  failures  can  only  be  perceived, 
and  both  actual  member  failures  and  network  partitions  lead  to  perceived  failures  of  the  members.  For  this 
reason,  every  perceived  failure  is  processed  as  an  event  that  partitions  the  group.  Partitions  of  the  member¬ 
ship  of  a  group  are  assumed  to  be  acceptable  to  the  user  of  the  membership  service,  who  may  make  QoS 
selections  to  determine  how  partitioned  groups  will  continue  to  function,  as  described  in  earlier  chapters. 
Unlike  many  other  membership  protocols,  majority-based  decisions  are  not  used  by  this  MS  to  ensure  that 
only  a  single  partition  suivaves;  instead,  complete  agreement  is  required  among  all  surviving  members  that 
can  communicate,  leading  to  the  possibility  of  separate,  functioning  partitions  of  any  size.  Continuous 
changes  to  the  membership  are  allowed,  however,  the  changes  are  committed  one  at  a  time,  and  with  a  spe¬ 
cific  order  in  each  partition. 

B.  TERMS  AND  DEFINITIONS 

The  specific  terms  and  implications  of  their  use  in  the  correctness  arguments  described  later  are 
listed  below. 

1.  Change  Events 

The  events  that  cause  a  change  in  the  membership  are:  explicit  join  and  leave  requests  by  members, 
perception  of  failure  by  the  monitoring  of  members  by  other  members,  and  suspicion  of  failures  resulting 
from  member  or  network  failures  which  lead  to  a  lack  of  response  during  change  processing. 


18 


2.  Change  Event  Priority 

Every  change  event  has  an  associated  priority  to  enable  ordering  of  virtually  simultaneous  changes. 
Failures  have  a  higher  priority  than  voluntary  joins  or  departures.  Priority  of  a  failure  or  departure  event  is 
the  rank,  or  seniority,  of  the  failed  member  in  the  group.  The  most  senior  member  always  has  a  rank  of  0. 
When  two  or  more  members  initiate  a  change  simultaneously,  the  coordinator  initiating  the  higher  priority 
change,  as  determined  by  the  rank  of  the  subject  of  the  change,  prevails.  In  virtually  simultaneous  joins,  the 
subjects  do  not  yet  have  a  group  rank,  so  the  network  address  of  each  subject  is  used  in  place  of  the  rank. 
The  subject  with  the  lower  network  address  will  be  interpreted  as  having  a  higher  temporary  rank,  and  there¬ 
fore  will  have  a  higher  priority,  joining  the  group  first. 

3.  Isolation 

A  member  that  perceives  another  member  as  faulty  ceases  all  communication  with  that  faulty  mem¬ 
ber.  This  leads  to  the  member  perceived  as  faulty  also  determining  that  the  other  member  is  faulty,  since  no 
communications  are  received. 

4.  Gossip 

A  member  that  isolates  another  member  gossips  about  the  isolation  in  the  subsequent  communica¬ 
tion  it  has  with  every  other  group  member  Thus,  in  the  absence  of  any  other  failures,  a  multicast  following 
an  isolation  leads  to  the  whole  group  isolating  the  member  that  was  perceived  faulty  by  the  sender  of  the 
multicast. 

5.  Group  View 

This  term  denotes  the  ordered  membership  list  maintained  by  each  member  m,,  and  is  denoted  as 
View^.(m_),  where  x  denotes  the  view  number. 

a.  Definition 

The  group  view  at  a  member  is  the  set  of  members  that  are  perceived  to  be  part  of  the 
group.  It  is  ordered  with  respect  to  the  seniority  of  members  in  the  group  and  has  an  integer,  called  a  view 
number,  associated  with  it. 

b.  Remarks 

Every  membership  change  alters  the  number  of  members  in  the  view  at  a  member  and 
leads  to  the  installation  of  a  new  view  identified  by  the  next  higher  view  number.  The  number  of  members  in 
the  group  may  change  by  more  than  one  in  a  single  view  change  The  rank  of  a  member  denotes  its  seniority 
in  the  group,  with  the  most  senior  member  having  rank  0.  Identical  views  imply  identical  membership  as 
well  as  ranks. 
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6.  Group  Partition 

Let  G  denote  the  set  of  all  possible  potential  and  current  members  of  a  group.  A  partition  P  of  G  is 
defined  below. 

a.  Definition 

P  is  a  subset  of  the  all  members’  set  G,  such  that  V  /w,,  €  P,  if  View^(ff;,)  and  View^/n,) 
are  defined,  then  €  View^(m.):  €  View  fin)  ^  ^  View/zw^),  and  all  members  have  the  same 

rank  in  the  two  views. 

b.  Remarks 

The  current  view  associated  with  partition  P  is  denoted  Viewp,  and  the  partition  contain¬ 
ing  m.  is  denoted  P(m).  Thus,  all  members  in  a  partition  must  have  identical  views.  However,  it  is  possible 
that  there  exists  an  outside  a  partition,  but  still  in  every  member's  view  for  that  partition.  Such  partitions 
are  called  unstable  partitions.  The  MS  protocol  treats  such  a  partition  as  legal,  and  eventually  removes 
from  the  views  of  all  members  of  the  partition.  When  no  such  exists  for  a  partition,  the  partition  is  called 
stable.  Both,  perceived  network  and  member  failures,  lead  to  the  creation  of  group  partitions  in  asynchro¬ 
nous  environments. 

7.  Group  Membership  Protocol 

Using  the  definitions  of  the  terms  above,  a  protocol  that  solves  the  Group  Membership  Problem 
(GMP)  is  defined  as  below. 

a.  Definition 

A  protocol  solves  the  GMP  correctly  if  every  change  event  results  in  group  partitions 

eventually. 

b.  Remarks 

The  above  definition  of  a  correct  solution  of  the  GMP  requires  it  to  satisfy  distinct  proper¬ 
ties  corresponding  to  the  underlined  conditions  in  the  definition  above. 

•  El  This  property,  arising  from  the  condition  of  every,  requires  that  a  change  event  observed  by  a  member 
is  processed  despite  other  virtually  simultaneous  change  events  and  failures  during  protocol  execution, 
including  that  of  the  coordinator.  The  only  situation  in  which  a  change  event  is  not  processed  is  in  case  of 
catastrophic  occurrences  in  which  all  the  members  with  knowledge  of  the  change  event  suffer  real  feilures. 

•  E2  This  property,  arising  from  the  condition  of  eventually,  permits  the  proces^  of  a  change  evart  to  be 
suspended  temporarily,  however,  it  requires  that  the  resulting  view  is  always  installed  at  all  members  of  the 
partition  before  the  change  event  occurred  and  after  only  a  finite  number  of  changes  are  allowed  to  take 
place. 

•  GP  This  property,  arising  from  the  condition  of  group  partitions,  implies  identical  views  at  all  members  of 
each  partition.  As  per  the  protocol  described,  all  partitions  resulting  from  change  processing  always 
become  stable. 
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Requirements  imposed  by  the  El  and  E2  properties  satisfy  the  condition  commonly  known  as  live- 
mss  in  distributed  systems  and  those  imposed  by  the  GP  property  satisfy  safety  [5,  24,  25],  Thus,  the 
uniqueness  of  views  and  identical  ordering  of  changes  at  all  operational  members  is  guaranteed  by  GP. 

C.  REMARKS  ON  THE  PROTOCOL  STRUCTURE 

Unless  specified  otherwise,  the  term  failure  is  assumed  to  imply  a  perceived  member  failure  that 
may  have  been  caused  by  either  a  network  failure  or  a  member's  failure.  Any  of  the  members  may  initiate  a 
change  when  it  perceives  a  change  according  to  the  change  events  described  earlier.  The  change  initiator  is 
called  the  coordinator  for  that  change  and  carries  out  the  membership  change  protocol  of  Figure  7.  The 
non-coordinator's  actions  consist  of  sending  the  ACK  message  and  committing  the  change.  Due  to  the  possi¬ 
bility  of  other  changes  occurring  during  a  change  processing,  both  the  coordinator  and  non-coordinators 
must  take  additional  actions.  Depending  upon  the  message  received  by  the  coordinator  as  it  collects  the 
ACKs,  it  switches  to  a  higher  priority  change,  enters  an  election,  or  delays  the  change  it  is  coordinating  due 
to  a  previous  change  that  may  not  yet  have  completed. 

D,  CORRECTNESS  ARGUMENTS 

Based  on  the  protocol  detailed  description  given  earlier,  a  proof  is  presented  that  shows  that  the 
MS  protocol  has  all  the  properties  as  identified  above  for  a  correct  solution  to  the  GMP.  Also  shown  is  that 
a  more  refined  solution  to  the  GMP  defined  earlier  by  Ricciardi  and  Birman  [9]  is  possible, 

1.  Claim  1 

Change  event  processing  always  compleies  at  both  the  coordinator  and  the  non-coordinator  ex¬ 
cept  when  all  members,  including  the  coordinator,  with  Icnowledge  of  the  change  fail. 

2.  Proof 

Consider  a  change  event  change(suhject,  coordinator)  initiated  in  P. 

a.  At  The  Coordinator 

Although  the  coordinator  makes  multiple  attempts  to  deliver  the  Initiate  message  to  all 
perceived  members  of  P,  it  does  not  require  a  predetermined  number  of  them  to  respond  before  it  sends  a 
Commit  message.  If  the  coordinator  switches  to  a  higher  priority  change  before  it  sends  a  Commit  message, 
the  information  about  the  old  change  is  saved.  The  old  change  is  reinitiated  by  the  coordinator  after  all  high¬ 
er  priority  changes  complete. 

b.  At  The  Non-Coordinator 

The  non-coordinators  waiting  for  the  Commit  of  the  lower  priority  change,  switch  to  the 
higher  priority  change,  when  the  corresponding  Initiate  is  received. 

Ifthe  coordinator  fails,  at  least  one  member  times  out  on  the  Commit  message  and  starts 
an  election.  The  highest  rank  member  with  the  change  active  is  elected  to  resume  the  change.  The  fact  that 
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the  election  is  conducted  among  those  with  knowledge  of  the  change  ensures  that  the  change  completes 
even  if  the  coordinator  and  the  only  members  to  have  committed  the  change  fail.  This  takes  care  of  the  in¬ 
visible  commits  described  by  Ricciardi  and  Birman  [9], 

3.  Claim  2 

In  any  partition,  either  only  one  change  event  proceeds  to  the  commit  phase,  or  members  reaching 
the  commit  phase  for  different  change  events  form  separate  partitions. 

4.  Proof 

Initially,  all  members  have  identical  views  of  the  membership  (definition  of  a  partition).  In  the  set 
of  all  potential  change  events,  there  exists  a  unique  priority  order  due  to  the  uniqueness  of  ranks,  which  or¬ 
der  failures  and  departures,  and  network-level  addresses,  which  order  joins.  This  permits  every  member  re¬ 
ceiving  multiple  Initiate  messages  before  receiving  any  Commit  message  to  switch  to  the  highest  priority 
change  that  will  install  the  next  view.  Overlapping  of  Initiate  messages  to  install  successive  views  with  dif¬ 
ferent  view  numbers  is  not  possible,  because  every  member  accepts  only  the  highest  priority  Initiate  to  es¬ 
tablish  the  current  change. 

Suppose  a  member  receives  a  Commit  message  for  the  current  change  that  will  change  the  view 
number  from  xtox^I.  Suppose  this  mserver  then  receives  a  higher  priority  change  that  also  corresponds  to 
a  view  number  change  from  x  to  x+I.  It  is  guaranteed  that  the  coordinator  sending  the  higher  priority  Initi¬ 
ate  appears  in  the  gossip  accompanying  the  received  Commit  message.  This  happens  because  the  coordina¬ 
tor  of  the  lower  priority  change  will  have  timed  out  on  the  coordinator  for  the  higher  priority  change  and 
isolated  it  before  generating  a  Commit  message.  This  ensures  further  partitioning. 

5.  Claim  3 

If  the  coordinator  fails  after  sending  the  commit  message,  a  two-phase  protocol  consisting  of  an 
election  followed  by  a  commit  can  solve  the  group  membership  problem  correctly. 

6.  Proof 

Begin  by  proving  the  contrapositive  statement: 

A  t\t’o-phase  protocol  consisting  of  an  election  followed  by  a  commit  cannot  solve  the  GMP  cor¬ 
rectly  if  the  coordinator  fails  before  sending  the  commit. 

If  the  coordinator  fails  before  sending  the  Commit  message,  it  is  possible  that  one  of  the  members 
has  not  yet  received  the  Initiate  message  for  the  change.  This  member  would  respond  in  the  election  with  a 
Coord-Fail  message  that  announces  that  it  is  not  aware  of  the  change  for  which  the  election  has  been 
started.  This  member  must  receive  an  Initiate  message  before  it  can  commit  the  change  for  which  the 
coordinator  failed.  If  the  Coord-Fail  message  is  used  to  start  the  change  in  place  of  a  separate  Initiate 
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message,  and  only  a  Commit  message  is  sent  to  complete  the  change,  then  the  GP  property  can  be  violated, 
as  shown  in  the  example  below. 

Consider  a  partition  consisting  of  members  m.,  m^  C^,  and  Q.  Let  Q  initiate  change  "a”  by 
multicasting  Initiate^,  which  is  received  only  by  m^  due  to  network  failures.  fails  immediately  after  send¬ 
ing  Initiate^  and  this  failure  is  perceived  by  m^,  which  then  starts  an  election  by  multicasting  Coord  Fail^. 
mj  and  participate  in  the  election,  but  Q  does  not  because  it  has  failed.  However,  before  failing,  Q  starts 
another  higher  priority  change  by  multicasting  Initiate^,  which  reaches  only  m^  due  to  network  failures. 
Since  change  ”b"  is  a  higher  priority  change,  m^  drops  change  "a”  as  the  current  change.  At  this  point,  m^ 
perceives  Q  failed  and  starts  an  election  by  multicasting  Coord  Fail^. 

Throughout  this  time,  m^  waits  to  hear  Q's  response  to  the  election  for  change  "a",  which  wiU  not 
arrive  due  to  Q's  failure  before  Coord  Fail ^  reaches  it.  Eventually,  m,  times  out  in  the  election,  determines 
that  it  must  be  the  winner,  and  assumes  the  responsibility  for  completing  change  "a".  Using  the  compressed 
3-phase  protocol  of  Figure  6(b),  m,  skips  the  Initiate^  and  commits  change  "d“  and  multicasts  Commit^  to 
the  group  with  gossip  about  Q's  isolation.  If  the  Commit^  reaches  m  and  after  they  have  switched  to 
change  "b"  due  to  the  Coord  Fail^,  message,  they  will  quietly  discard  the  Commit^  message  due  to  its  lower 
priority.  Thus,  m^  will  have  committed  change  "a”,  whereas  m^  and  m^.  will  never  commit  it.  This  inconsis¬ 
tency  violates  the  GP  property  and  makes  the  use  of  a  two-phase  protocol  incorrect  in  this  situation.  Thus, 
the  contrapositive  statement  is  proved.  This  will  not  happen  if  m,  used  all  the  three  phases. 

The  contrapositive  statement  proves  Claim  3  above.  It  should  be  noted  that  the  failure  of  the  coor¬ 
dinator  after  sending  the  Commit  message  with  simultaneous  failures  of  all  members  that  receive  the  Commit 
message  is  equivalent  to  the  coordinator  failing  before  sending  the  Commit  message.  It  is  not  possible  to 
differentiate  between  these  two  situations,  thus  the  change  must  be  completed  in  three  phases.  In  the  proto¬ 
col  described  in  this  report,  the  three  phases  are  the  broadcast  election,  initiate,  and  commit  phases.  Thus, 
the  elected  coordinator  is  required  to  complete  the  change  with  a  Commit  message  if  some  member  that  had 
committed  the  change  participates  in  the  election,  permitting  a  two-phase  processing  of  the  coordinator  fail¬ 
ure.  Otherwise,  the  elected  coordinator  simply  reinitiates  the  change,  providing  a  three-phase  processing  of 
the  original  change. 

7.  Theorem 

The  proposed  group  membership  protocol  is  safe  and  live. 

8.  Proof 

The  liveness  properties  El  and  E2  follow  directly  from  Claim  1.  The  safety  GP  property  follows 
from  Claim  2  and  3. 
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VI  CONCLUSIONS  AND  FUTUREWORK 


A.  CONCLUSIONS 

This  report  presented  a  globally  scaleable,  fully  decentralized  group  membership  service  which  pro¬ 
vides  the  framework  for  distributed  applications  of  any  size  and  distribution.  A  complete  description  of  the 
architectural  design  and  protocol  specifications  was  presented,  and  a  complete  implementation  of  the  basic 
membership  is  described. 

The  most  significant  contribution  of  the  group  membership  service  described  herein  is  the  definition 
of  the  hierarchy  and  associated  functions.  The  MS  provides  a  consistent  suite  of  services  to  client  applica¬ 
tions  distributed  on  any  scale,  from  a  single  LAN  to  the  worldwide  internetwork.  No  other  membership  pro¬ 
tocol  or  service  provides  this  capability.  Other  noteworthy  contributions  include  the  decentralized  and 
efficient  nature  of  the  MS,  the  concept  of  providing  numerous  user-selectable  Quality  of  Service  options  to 
tailor  the  MS  to  the  needs  of  each  client  application  and  a  programmer's  interface  permitting  flexible  access 
to  the  membership  service. 

B.  FUTURE  WORK 

Although  a  great  deal  of  work  was  accomplished  in  the  design  and  protocol  implementation  for  the 
MS  described  in  this  report,  much  more  work  remains  to  be  done.  First,  to  demonstrate  that  the  MS  is  truly 
scaleable  to  global  proportions,  a  working  implementation  of  the  complete  MS  must  be  developed  and  in¬ 
stalled  on  progressively  larger  scales.  Second,  complete  performance  analysis  of  the  operation,  overhead, 
network  constraints,  service  latency,  and  functionality  of  the  MS  must  be  accomplished.  Third,  a  complete 
formal  specification  of  the  protocols  used  by  the  MS  must  be  accomplished,  with  a  reachability  analysis  con¬ 
ducted  to  demonstrate  formally  correct  operation.  Finally,  the  MS  must  be  extended  to  take  advantage  of 
the  reliable,  high-speed  networks  with  newer  network-level  multicasting  options,  which  are  currently  being 
deployed. 
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